Practical: early outbreak assessment, transmissibility and forecasting

Introduction

On the tidyverse

The tidyverse is a collection of R packages designed for data science. Developed with high standards of coding practices and software development, these packages form a consistent ecosystem to handle and visualise data. In this practical, we will mostly rely on the following packages:

  • dplyr for handling data, making new variables, building summaries, number-crunching

  • tidyr to re-arrange tidy/data

  • ggplot2 to visualise data

  • magrittr for the piping operator (%>%)

Most of these functionalities are summarised in handy cheatsheets. We provide links to the most relevant ones below:


Other packages

Other packages we will use include:

  • rmarkdown for automated report generation

  • rio to read .xlsx files in

Cheatsheets and other resources:

  • the rmarkdown cheatsheet

  • the knitr website documenting many options used in rmarkdown’s settings and code chunks


On the RECONverse

Several RECON packages will be used in the practical:

  • linelist for data cleaning

  • incidence older package to build and visualise epidemic curves, and do basic model fitting; it is being replaced by incidence2 but some older packages like earlyR (for estimating the reproduction number) and projections (for short term forecasting) are still using it.

  • incidence2 re-implementation of incidence, to build and handle epidemic curves

  • i2extras provides simplified wrappers for estimating growth rates using different GLMs from incidence2 objects

  • epicontacts to visualise and analyse contact tracing data or transmission chains

  • epitrix to estimate delay distributions

  • distcrete to handle delay distributions

  • earlyR for early-stage estimation of the reproduction number

  • projections for short-term forecasting

Note that earlyR and projections will soon be adapted to handle incidence2 objects.

For some of these packages, we recommend using the github (development) version, either because it is more up-to-date, or because they have not been released on CRAN yet.


What we will do today

This practical will use different outbreak data to illustrate how the various packages mentioned above can be combined to explore outbreak data and carry out analyses specific to the outbreak response context. Building on material seen in previous sessions, we will particularly look at:

  • cleaning / standardising ‘dirty’ data
  • building epidemic curves
  • derive empirical distributions of key delays
  • fit delays to discretized Gamma distributions using Maximum Likelihood, and summarising the results
  • visualise and analyse transmission chains / contact tracing data
  • estimate the growth rate and doubling time from epidemic curves
  • estimate the reproduction number from epidemic curvse
  • perform short-term forecasting of case incidence

In addition, we strongly recommend storing all analyses into one or several rmarkdown reports.


How this document works

This practical will take you through initial exploratory analyses of different datasets. Some parts will be used merely to illustrate a technical point (e.g. how to count cases by different groups), some will focus more on the epidemiological meaning of a result, and some will do both. These analyses are merely meant as a guide, and are by no means an exhaustive exploration of the data. Be curious: feel free to ask new questions and try to answer them. In all instances, try to first answer your own questions: try/error is a great way to learn R. But if you feel stuck, do not hesitate to questions to the demonstrators: they are here to help.

In this practical, most command lines are provided, but not all. In the few instances where code is not provided, you will typically need to re-use and adapt previous code to produce the result requested. In most cases, copy-paste should be sufficient to reproduce the results.

However, it is important that you understand what the code does. Look at relevant documentation, try tweaking the code, experiment, and do not hesitate to ask questions. It is not essential to go through all examples to illustrate important points, so take your time, especially if you are new to R. Also note that the solutions provided are by not means the only options for solving a given problem - feel free to explore alternatives and discuss them with the trainers.




A novel EVD outbreak in Ankh, Republic of Morporkia

This practical simulates the early assessment of an Ebola Virus Disease (EVD) outbreak. It introduces various aspects of data analysis of the early stage of an outbreak, including contact tracing data, characterisation of the serial interval, and estimation of the growth rate, doubling time and reproduction number from epidemic curves.

A new EVD outbreak has been notified in the small city of Ankh, located in the Northern, rural district of the Republic of Morporkia. Public Health Morporkia (PHM) is in charge of coordinating the outbreak response, and have contracted you as a consultant in outbreak analytics to inform the response in real time.


Importing data

While a new data update is pending, you have been given the linelist and contact data describing the early stages of the outbreak as xlsx files:

  • phm_evd_linelist_2017-10-27.xlsx: a linelist containing case information up to the 27th October 2017

  • phm_evd_contacts_2017-10-27.xlsx: a list of contacts reported between cases up to the 27th October 2017, where from indicates a potential source of infection, and to the recipient of the contact.

To import these files:

  1. create a data/ folder in your project directory

  2. download and save these files in data/

  3. import the files using here::here to find the path to the data, and rio::import to read the files and import them into R


Data cleaning

Once imported, the raw data should look like:

Examine these data and try to list existing issues.

What is wrong with the raw data?

The raw data, whilst very small in size, have a number of issues, including:

  • capitalisation: mixture of upper case and lower cases

  • dates: their format is very heterogeneous, so that dates are not readily useable

  • special characters: accents, various separators and trailing / heading white spaces

  • typos / inconsistent coding: gender is sometimes indicated in full words, sometimes abbreviated; location has some obvious abbreviations and typos

We will use the function clean_data() from the linelist package to clean these data. This function will:

  • set all characters to lower case
  • use _ as universal separator
  • remove all heading and trailing separators
  • replace all accentuated characters by their closest ASCII match
  • detect date formats and convert entries to Date when relevant (see argument guess_dates)
  • replace typos and recode variables (see argument wordlists)

For this last part, we need to load a separate file containing cleaning rules: phm_evd_cleaning_rules.xlsx.

Examine this file (and read the ‘explanations’ tab), import it as the other files, and save the resulting data.frame as and object called cleaning_rules. The output should look like:

##   bad_spelling good_spelling variable
## 1            f        female      sex
## 2            m          male      sex
## 3     .missing       unknown      sex
## 4           am  ankh_morpork location
## 5   peudopolis   pseudopolis location
## 6     .missing       unknown location

We can now clean our data using clean_data; execute and interprete the following commands:


Looking at transmission chains


Exercise: contact tracing is at the centre of an Ebola outbreak response. Using the same approach seen in the previous practical, build an epicontacts object using the function make_epicontacts, and store the results as x.


## 
## /// Epidemiological Contacts //
## 
##   // class: epicontacts
##   // 12 cases in linelist; 11 contacts;  directed 
## 
##   // linelist
## 
## # A tibble: 12 × 6
##    id     date_of_onset date_report sex       age location    
##    <chr>  <date>        <date>      <chr>   <dbl> <chr>       
##  1 39e9dc 2017-10-10    2017-10-22  female     62 pseudopolis 
##  2 664549 2017-10-16    2017-10-24  male       28 pseudopolis 
##  3 b4d8aa 2017-10-17    2017-10-23  male       54 ankh_morpork
##  4 51883d 2017-10-18    2017-10-22  male       57 pseudopolis 
##  5 947e40 2017-10-20    2017-10-25  female     23 ankh_morpork
##  6 9aa197 2017-10-20    2017-10-23  female     66 ankh_morpork
##  7 e4b0a2 2017-10-21    2017-10-24  female     13 ankh_morpork
##  8 af0ac0 2017-10-21    2017-10-26  male       10 ankh_morpork
##  9 185911 2017-10-21    2017-10-26  female     34 ankh_morpork
## 10 601d2e 2017-10-22    2017-10-28  unknown    11 ankh_morpork
## 11 605322 2017-10-22    2017-10-28  female     23 ankh_morpork
## 12 e399b1 2017-10-23    2017-10-28  female     23 ankh_morpork
## 
##   // contacts
## 
## # A tibble: 11 × 2
##    from   to    
##    <chr>  <chr> 
##  1 51883d 185911
##  2 b4d8aa e4b0a2
##  3 39e9dc b4d8aa
##  4 39e9dc 601d2e
##  5 51883d 9aa197
##  6 39e9dc 51883d
##  7 39e9dc e399b1
##  8 b4d8aa af0ac0
##  9 39e9dc 947e40
## 10 39e9dc 664549
## 11 39e9dc 605322

You can easily plot these contacts, but with a little bit of tweaking (see ?vis_epicontacts) you can customise shapes by gender:


Question: What can you say about these contacts? Are there evidences of super-spreading? Can you estimate it at this stage?



Looking at incidence curves

The first question PHM asks you is simply: how bad is it?. Given that this is a terrible disease, with a mortality rate nearing 70%, there is a lot of concern about this outbreak getting out of control. The first step of the analysis lies in drawing an epicurve, i.e. an plot of incidence over time.


In principle, the new package incidence2 should be chosen to build epidemic curves. However, here we shall use its older brother incidence, which provides some facilities for estimating growth rates which will later be used.

We compute daily incidence based on the dates of symptom onset, storing results in an object called i:


Exercise: If you pay close attention to the dates on the x-axis, you may notice that something is missing. Indeed, the graph stops right after the last case, but field epidemiologists from PHM insist that the data should be complete until the 27th October 2017. Try to fix this using the argument last_date in the incidence function.


## <incidence object>
## [12 cases from days 2017-10-10 to 2017-10-27]
## 
## $counts: matrix with 18 rows and 1 columns
## $n: 12 cases in total
## $dates: 18 dates marking the left-side of bins
## $interval: 1 day
## $timespan: 18 days
## $cumulative: FALSE


Estimating the growth rate (r)

The simplest model of incidence is probably the log-linear model, i.e. a linear regression on log-transformed incidences. In the incidence package, the function fit will estimate the parameters of this model from an incidence object (here, i), and derive estimate of doubling / halving times and associated confidence intervals:


Question: How would you interpret this result? What criticisms would you make of this model?



Estimating the reproduction number (R)

Branching process models can be used to estimate the reproduction number (R), given incidence data and a serial interval distribution. As you have already built an incidence object, the only missing information relates to the serial interval distribution. As current data are insufficient to estimate it, you can use previously published estimates with a mean of 15.3 days and a standard deviation 9.3 days (WHO Ebola Response Team 2014).


Exercise: Look at the documentation of the function get_R from the earlyR package, and identify the arguments corresponding to the incidence data, the mean and standard deviations of the serial interval. Run the function with the correct arguments, and store the output as a new object called R; results should resemble:


You can visualise the results as follows:

The first figure shows the distribution of likely values of R, and the Maximum-Likelihood (ML) estimation. The second figure shows the global force of infection over time, with dashed bars indicating dates of onset of the cases.


Question: Interpret these results: what do you make of the reproduction number? What does it reflect? Based on the last part of the epicurve, some colleagues suggest that incidence is going down and the outbreak may be under control. What is your opinion on this?



Short-term forecasting

The function project from the package projections can be used to simulate plausible epidemic trajectories by simulating daily incidence using the same branching process as the one used to estimate \(R_0\) in earlyR. All that is needed is one or several values of \(R_0\) and a serial interval distribution, stored as a distcrete object. As this one is included in the output of earlyR, we have all the necessary inputs to carry out these simulations.

Here, we illustrate how we can simulate 5 random trajectories using a fixed value of \(R_0\) = 3.02, the ML estimate of \(R_0\):


Exercise: Read the commands below and interpret the results.



Question: According to these results, what are the chances that more cases will appear in the near future? Is this outbreak being brought under control? Would you recommend scaling up / down the response?




COVID-19 in the UK

For the second part of this practical, we look at real data reporting COVID-19 hospitalisation in England, broken down to the hospital and National Health Service (NHS) region level. The data is available online from the NHS England’s website. The dataset analysed here is a simplified version, providing incidence of hospital admissions by NHS trust, stored in the file covid_admissions_uk_2020_10_24.xlsx.


Distribution of cases by NHS region

Here, we try to get an overall picture of the distribution of hospital admissions across England. The following code sums the number of cases reported by region:


Exercise: Adapt this code to derive total number of cases by region and NHS trust, and provide a visual representation of the resulting numbers and interpret the results; your plot may resemble:



Epidemic curves

Unlike linelists of cases, this dataset contains pre-calculated counts by date, NHS region and trust. While incidence2 is primarily designed to handle linelist-like data, it can also convert pre-computed incidences using the count argument:


Exercise: Use facet_plot, tweaking the arguments angle, format and the one specifying the number of dates on the x-axis, to reproduce the plot below; which do you think is best at representing dynamics of COVID-19 in England?



Growth rates

The new package i2extras adds a number of tools for incidence2 objects, including the implementation of different GLMs to fit epicurves, and derive growth rates.

To illustrate these approaches, we first select the data so that:

  • we use daily incidences by region
  • we keep the last 4 weeks of data
  • we exclude the last week of data, as it may be prone to biases due to reporting delays


Exercise: We can now fit models to these data. To do so, try doing the following:

  1. load the package i2extras; if not installed, follow instructions from the package’s website

  2. use the function fit_curve() to fit a negative binomial model to the incidence data by region, save the output as a new object last_trends, and plot it

  3. feed this output to the function growth_rate() to obtain estimates of growth rates, doubling times and their confidence intervals, for each region; the results should resemble:


## # A tibble: 7 × 9
##   count_variable region       data model estimates fitting_warning fitting_error
##   <chr>          <chr>    <list<t> <lis> <list>    <list>          <list>       
## 1 n              East of… [21 × 2] <glm> <df [21 … <NULL>          <NULL>       
## 2 n              London   [21 × 2] <glm> <df [21 … <NULL>          <NULL>       
## 3 n              Midlands [21 × 2] <glm> <df [21 … <NULL>          <NULL>       
## 4 n              North E… [21 × 2] <glm> <df [21 … <NULL>          <NULL>       
## 5 n              North W… [21 × 2] <glm> <df [21 … <NULL>          <NULL>       
## 6 n              South E… [21 × 2] <glm> <df [21 … <NULL>          <NULL>       
## 7 n              South W… [21 × 2] <glm> <df [21 … <NULL>          <NULL>       
## # … with 2 more variables: prediction_warning <list>, prediction_error <list>

## # A tibble: 7 × 10
##   count_variable region       model      r r_lower r_upper growth_or_decay  time
##   <chr>          <chr>        <lis>  <dbl>   <dbl>   <dbl> <chr>           <dbl>
## 1 n              East of Eng… <glm> 0.0495  0.0331  0.0661 doubling        14.0 
## 2 n              London       <glm> 0.0504  0.0395  0.0613 doubling        13.8 
## 3 n              Midlands     <glm> 0.0556  0.0475  0.0638 doubling        12.5 
## 4 n              North East … <glm> 0.0491  0.0426  0.0556 doubling        14.1 
## 5 n              North West   <glm> 0.0596  0.0537  0.0655 doubling        11.6 
## 6 n              South East   <glm> 0.0721  0.0574  0.0871 doubling         9.61
## 7 n              South West   <glm> 0.0948  0.0769  0.113  doubling         7.31
## # … with 2 more variables: time_lower <dbl>, time_upper <dbl>

The code below can be used to visualise the results:


Exercise: repeat the same figure with doubling times. Taken together, what do these results suggest about the evolution of COVID-19 in the UK over the next weeks? What would be the doubling time for growth rates (r) close to 0? When would you recommend using doubling times for communicating results on the evolution of an epidemic?




References

WHO Ebola Response Team. 2014. “Ebola Virus Disease in West Africa–the First 9 Months of the Epidemic and Forward Projections.” N. Engl. J. Med. 371 (16): 1481–95.